Â
   Â
 Â
PLEASE SEE ALL TABS :)
 Â
 Â
Item quantity, name and category is null
Jumpman arrive at pick up location time is before the start of delivery time.
Jumpman arrived and left pick-up time is null
Place Category is null
How Long It Took To Order is null
 Â
Description of the issue: It appears that within some of the delivery rows there is a “N/A” value for the item quantity, category and name.
Percent of effected Records: 20.5 %
Why is this an issue: Sending a jumpman to a pickup location without any desired items to deliver undermines the value Application
Item quantity N/A example image
ggplot() + geom_density(data = numbered_quantity, aes(x=time_to_deliver, fill= "Number"), adjust = 1, alpha=.3) + geom_density(data = empty_quantity, aes(x=time_to_deliver, fill="N/A"), adjust = 1, alpha=.3) + jhilliard_theme + scale_fill_manual(name="Item Quantity", values=c(Number="#56B4E9", "N/A"="#E69F00")) + labs(x="Minutes", y="Density", title="Time to Complete Delivery")+ xlim(0, 125)
ggplot() + geom_density(data = numbered_quantity, aes(x=distance_haversine_in_miles, fill= "Number"), adjust = 1, alpha=.3) + geom_density(data = empty_quantity, aes(x=distance_haversine_in_miles, fill="N/A"), adjust = 1, alpha=.3) + jhilliard_theme + scale_fill_manual(name="Item Quantity", values=c(Number="#56B4E9", "N/A"="#E69F00")) + labs(x="Miles", y="Density", title="Delivery Distance")
one.sample.z(empty_quantity$time_to_deliver, mean(numbered_quantity$time_to_deliver), sigma = sd(numbered_quantity$time_to_deliver))
##
## One sample z-test
## z* P-value
## -10.17845 mins 2.474603e-24 mins
one.sample.z(empty_quantity$distance_haversine_in_miles, mean(numbered_quantity$distance_haversine_in_miles), sigma = sd(numbered_quantity$distance_haversine_in_miles))
##
## One sample z-test
## z* P-value
## -9.84389 7.283858e-23
The issue is not stemming from certain customers, jumpmen, pickup places, vehicle types or place categories as non of the values are unique to rows with empty item quantities.
There doesn’t seem to be an issue with pickup and drop off locations as the locations for rows with empty item quantity fields and not empty quantity fields are spread throughout the city.
ny <- get_map(location = c(lon = -73.972026, lat = 40.745362), zoom = 12)
pickup_num <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = pickup_lon, y = pickup_lat), size = .3, data = numbered_quantity) + labs(title= "Normal Pickup Locations")
pickup_empty <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = pickup_lon, y = pickup_lat), size = .3, data = empty_quantity) + labs(title= "N/A Item Quantity Pickup Locations")
grid.arrange(pickup_num, pickup_empty, nrow=1, ncol=2)
pickup <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = dropoff_lon, y = dropoff_lat), size = .3, data = numbered_quantity) + labs(title= "Normal Dropoff Locations")
dropoff <- ggmap(ny, extent = "device", legend = "topleft") + geom_point( aes(x = dropoff_lon, y = dropoff_lat), size = .3, data = empty_quantity) + labs(title= "N/A Item Quantity Dropoff Locations")
grid.arrange(pickup, dropoff, nrow=1, ncol=2)
 Â
Description of the issue: It appears that within some of the delivery rows there is a “N/A” value for the time the jumpman arrived at the pickup destination
Percent of effected Records: 9.2 %
Why is this an issue: Not knowing the amount of time spent at a pick up location can create unpredictability in wait time at pick up locations. It becomes difficult to calculate which pick locations are the most profitable.
Arrived and left pick-up time example image
median(pickup_time$distance_haversine_in_miles)
## [1] 0.8664454
median(no_pickup_time$distance_haversine_in_miles)
## [1] 0.610389
median(pickup_time$time_to_deliver)
## Time difference of 42.45873 mins
median(no_pickup_time$time_to_deliver)
## Time difference of 46.2639 mins
one.sample.z(no_pickup_time$time_to_deliver, mean(pickup_time$time_to_deliver), sigma = sd(pickup_time$time_to_deliver))
##
## One sample z-test
## z* P-value
## 4.190639 mins 2.781696e-05 mins
one.sample.z(no_pickup_time$distance_haversine_in_miles, mean(pickup_time$distance_haversine_in_miles), sigma = sd(pickup_time$distance_haversine_in_miles))
##
## One sample z-test
## z* P-value
## -4.206264 2.596265e-05
# Since all these numbers are greater than 0, we can validate the statement above
ny <- get_map(location = c(lon = -73.972026, lat = 40.745362), zoom = 13)
pickup <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = pickup_lon, y = pickup_lat, alpha = ..level..), size = 2, bins = 4, data = no_pickup_time, geom = "polygon") + labs(title= "Density of Pickup Location \n with No Pickup Checkin Time")
dropoff <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = dropoff_lon, y = dropoff_lat, alpha = ..level..), size = 2, bins = 4, data = pickup_time, geom = "polygon") + labs(title= "Density of Pickup Location \n with Pickup Checkin Time")
grid.arrange(pickup, dropoff, nrow=1, ncol=2)
 Â
See: Other issues not broken down
Pickup time before order time example image
 Â
As this is an pre-interview assignment, I am not going to go in-depth on the other 3 issues that I found with the data.
BIG ONE: JUMPMAN ARRIVE AT THE PICK UP LOCATION BEFORE THE DELIVERY IS STARTED
Place Category is null
How Long It Took To Order is null
 Â
 Â
 Â
While deliveries per day and unique customers ordering per day is growing, customer acquisition is declining. This would suggest new customers are being retained and ordering more than once.
 Â
 Â
PLEASE SEE ALL TABS :)
 Â
 Â
For this analysis as to not have the “dirty” rows affect the findings, I have cleaned out rows from the Data set with pick up times that are before order times and rows with empty item quantity values.
ggplot(order_by_day_no_dup_delivery_id, aes(x = as.Date(date_no_time), y = count)) + geom_point() + geom_smooth(colour="darkgoldenrod1", size=1.5, method="loess", se=FALSE) + geom_smooth(method="lm", se=FALSE) + labs(x="Date", y="Number of Deliveries", title="Number of Deliveries per Day")+ jhilliard_theme
The growth of delivers is modest, but growing since the open of the NYC market. There are peaks that correspond with Sundays.
pickup <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = pickup_lon, y = pickup_lat, alpha = ..level..), size = 2, bins = 4, data = jumpman_data_cleaned, geom = "polygon") + labs(title= "Density of Pickup location")
dropoff <- ggmap(ny, extent = "device", legend = "topleft") + stat_density2d( aes(x = dropoff_lon, y = dropoff_lat, alpha = ..level..), size = 2, bins = 4, data = jumpman_data_cleaned, geom = "polygon") + labs(title= "Density of Dropoff Locations")
grid.arrange(pickup, dropoff, nrow=1, ncol=2)
Delivery pick up location concentration is higher than drop off location concentration. Customers tend to order from places in East Village and lower Manhattan, but orders get delivered to a much broader area. This makes sense, as East Village has a large concentration of shops and restaurants. One interesting thing of note is drop off location density in the upper east side, this MAY imply that the customer market segment tends to have higher incomes.
ggplot(na.omit(jumpman_day_hour), aes(x = hour_ordered, y = day_ordered, fill=count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 10)) +
labs(x = "Hour of Request", y = "Day of Week of Request", title = "# of Delivery Requests in NYC, by Day and Time of Request") +
scale_fill_gradient(low = "white", high = "#2980B9")
Orders tend to be concentrated around dinner time for all days of the week. Delivery times very more on weekends, but still seems to be concentrated around dinner. Weekends have more orders than weekdays.
ggplot(data=jumpman_data_cleaned) + geom_histogram(aes(x=distance_haversine_in_miles), fill="springgreen4", binwidth = .15) + labs(x="Miles", y="Delivery Count", title="Distance from Pickup to Dropoff Location") + scale_fill_manual(name="Item Quantity", values=c("mean_line"="orange", "median_line"="purple")) + jhilliard_theme + geom_vline(aes(xintercept = mean(distance_haversine_in_miles), colour = "mean_line")) + geom_vline(aes(xintercept = median(distance_haversine_in_miles), colour = "median_line"))
There is a uni-modal distribution for pick up to drop off locations. Pick up to drop off location distance tend to be under 2 miles with a median of ~1 mile. Since some deliveries are further than 4 miles, it would be worth investigating delivery methods of those deliveries.
ggplot(data=jumpman_data_cleaned) + geom_histogram(aes(x=time_to_deliver), fill="springgreen4", binwidth = 5) + labs(x="Minutes", y="Delivery Count", title="Time to Complete the Delivery") + scale_fill_manual(name="Item Quantity", values=c("mean_line"="darkblue", "median_line"="purple")) + jhilliard_theme + geom_vline(aes(xintercept = mean(time_to_deliver_num), colour = "mean_line")) + geom_vline(aes(xintercept = median(time_to_deliver_num), colour = "median_line"))
The distribution of delivery completion time is uni-modal with a median of ~45 minutes. There are some deliveries that took over an hour, it would be worth investigating those to see why they took longer than an hour.
ggplot(data=pickup_location) + geom_histogram(aes(x=time_to_pickup/60), binwidth = 5, fill="springgreen4") + labs(x="Minutes", y="Delivery Count", title="Time to Reach Pickup Location") + scale_fill_manual(name="Item Quantity", values=c("mean_line"="darkblue", "median_line"="purple")) + jhilliard_theme + geom_vline(aes(xintercept = mean(time_to_pickup_num)/60, colour = "mean_line")) + geom_vline(aes(xintercept = median(time_to_pickup_num)/60, colour = "median_line"))
The distribution of pickup location arrival time is uni-modal with a median of ~20 minutes. The negative values in this column have been cleaned for this graph.
 Â
ggplot(customer_by_day, aes(x = as.Date(date_no_time), y = count)) + geom_point() + geom_smooth(colour="darkgoldenrod1", size=1.5, method="loess", se=FALSE) + geom_smooth(method="lm", se=FALSE) + labs(x="Date", y="Number of Customers", title="Number of Unique Customers who Ordered in a given Day") + jhilliard_theme
The growth of unique customers orders per day is growing, but at a modest rate. There are peaks that correspond with Sundays. BUT…
ggplot(customer_first_order_day, aes(x = as.Date(earliest_order_date), y = count)) + geom_point() + geom_smooth(colour="darkgoldenrod1", size=1.5, method="loess", se=FALSE) + geom_smooth(method="lm", se=FALSE) + labs(x="Date", y="Number of New Customers", title="First Time Customers Who Ordered per Day") +jhilliard_theme
It is important to note, new customer acquisition has gone down since the launch of the NYC market. This means that new customer orders has gone down. We can be solve this by a customer acquisition campaign.
ggplot(data = daily_retention) + geom_line(aes(y=Prc, x=date_diff, group=date1), colour="grey") + geom_smooth(aes(y=Prc, x=date_diff), colour="darkgoldenrod1", size=1.5, se=FALSE) + labs(y="Percent Retained", x="Days Since Order", title="Likelihood of an Order N Days after a Seperate Order \n from a Unique Customer") + jhilliard_theme
The expected likelihood of another order coming in n days after an order is ~3%. This means, if you buy 100 ordering customers through an acquisition campaign, it would not be unreasonable to expect, this set of customers would generate ~3 returning customers per day (for the foreseeable future) after initially acquired. While this doesn’t tell the whole customer retention story, it is a nice start.
I ignore the increasing trend at the end of the graph because it is likely caused by the early adopter bias.
Top 10 Percentile of Pickup Locations as determined by number of items ordered.
map <- leaflet()
map <- addTiles(map)
order_place <- aggregate(cbind(count = item_quantity) ~ pickup_place + pickup_lat + pickup_lon,
data = jumpman_data_cleaned,
FUN = function(x){ NROW(x) })
order_place <- data.frame(order_place)
order_place <- order_place %>% rowwise() %>% mutate(popup = paste(pickup_place, toString(count), sep=" : ") )
order_place <- order_place[ which(order_place$count >= 21), ]
map <- addMarkers(map, lng=order_place$pickup_lon, lat=order_place$pickup_lat, popup=order_place$popup)
map
This map shows the top 10% of places people placed that an ordered an item. The user can click on the marker to get the name of the restaurant and the number of items ordered from that location.